3rd Year PhD student in the Dept. of Biostatistics
Interests include: Cancer Biostatistics, Clinical Trials, Surrogate Endpoints
Hobbies: Cooking, Hiking, Volleyball, Traveling
Statistical models are ubiquitous in public health research
Communicating results to stakeholders is an important part of the modeling process
Visualizations can often present results in a manner that is more digestible than tables or other summaries
Reference: Landre et al (2024)
What information is easily accessible from the forest plot that is difficult to ascertain from the table?
Results from a multiple linear regression exploring the impact of country-level variables on average life expectancy.
What is the first thing that stands out to you in this plot?
Which of Tufte’s principles of graphical excellence does this violate?
Focus was on visualization tools that display data in its raw form
boxplots, histograms show distribution of observed data
scatter plots show relationship between two continuous variables
How can we use the color, shapes, and other aesthetics to best convey information contained within the data?
Focus on visualization tools that display results from data analysis.
parameter estimates (i.e. regression coefficients)
uncertainty in estimation (i.e. confidence bands)
predicted values
How can we use the color, shapes, and other aesthetics to best convey the results of statistical modeling?
#Filter out observations from the Versicolor Species
iris <- iris %>% filter(Species != "versicolor") %>%
mutate(Species = as.factor(Species))
#Plot sepal width by sepal length colored by species
ggplot(iris, aes(x = Sepal.Length, y=Sepal.Width, color = Species, shape=Species)) +
geom_point(size = 2.5) +
xlab("Sepal Length (L)") +
ylab("Sepal Width (W) ") +
theme(plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15),
legend.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.key.size = unit(1, "cm"))\[ W_i = \beta_0 + \beta_1L_i + \beta_2\text{Setosa}_i + \beta_3L_i\text{Setosa}_i + \epsilon_i \]
When Species = Setosa, \(W_i = (\beta_0 + \beta_2) + (\beta_1+\beta_3)L_i + \epsilon_i\)
When Species = Virginica, \(W_i = \beta_0 + \beta_1L_i + \epsilon_i\)
When Species = Setosa, \(W_i = (\beta_0 + \beta_2) + (\beta_1+\beta_3)L_i + \epsilon_i\)
When Species = Virginica, \(W_i = \beta_0 + \beta_1L_i + \epsilon_i\)
\(\beta_3 = 0.5666 > 0\), so the slope of the regression line for Setosa is greater than the slope for Virginica.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept (B0) | 1.4463 | 0.4069 | 3.5546 | 0.0006 |
| Sepal Length (B1) | 0.2319 | 0.0615 | 3.7717 | 0.0003 |
| Setosa (B2) | -2.0157 | 0.6894 | -2.9238 | 0.0043 |
| Sepal Length:Setosa (B3) | 0.5666 | 0.1268 | 4.4684 | 0.0000 |
#Set the reference level of Species to back to setosa for plotting
iris$Species <- relevel(iris$Species, ref = "setosa")
#Plot sepal width by sepal length colored by species
#Add regression lines with confidence intervals
ggplot(iris, aes(x = Sepal.Length, y=Sepal.Width, color = Species, shape=Species)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = TRUE, fullrange = TRUE) +
xlab("Sepal Length") +
ylab("Sepal Width") +
theme(plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15),
legend.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.key.size = unit(1, "cm"))It is best not to extrapolate the model fit
#Plot sepal width by sepal length colored by species
#Add regression lines with confidence intervals
#Restrict confidence bands to range of observed data
ggplot(iris, aes(x = Sepal.Length, y=Sepal.Width, color = Species, shape=Species)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = TRUE, fullrange = FALSE) +
xlab("Sepal Length") +
ylab("Sepal Width") +
theme(plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15),
legend.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.key.size = unit(1, "cm"))mod <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)
predictions <- predict(mod, interval = "prediction")
iris_2 <- cbind(iris, predictions)
ggplot(iris_2, aes(x = Sepal.Length, y=Sepal.Width, color = Species, shape=Species)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = TRUE, fullrange = FALSE) +
geom_line(aes(y = lwr)) +
geom_line(aes(y = upr)) +
xlab("Sepal Length") +
ylab("Sepal Width") +
theme(axis.title = element_text(size = 18),
axis.text = element_text(size = 15),
legend.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.key.size = unit(1, "cm"))Data were collected as part of an observational study of stroke patients.
explored the association between psychological factors (depression, optimism, fatalism, and spirituality) and stoke outcomes
we will consider the outcome variable Depression as it relates to other psychological and demographic factors.
Morgenstern, L. B., et al (2011). Fatalism, optimism, spirituality, depressive symptoms, and stroke outcome: a population-based analysis. Stroke, 42(12), 3518–3523. https://doi.org/10.1161/STROKEAHA.111.625491
Key Variable Names:
Depression
Fatalism
Db
Age_less_eq54
Age_55to64
Age_65to74
Age_ge75
No Spice: Recreate my plot from slide 16, but using the stroke dataset with Fatalism as the independent variable, Depression as the dependent variable and Db (diabetes status) as the grouping variable.
Weak Sauce: Add prediction intervals to the plot above.
Medium Spice: No menu options today.
Yoga Flame: No menu options today.
Dim Mak: Similar to the example on slide 7, create a forest plot that shows the effect of Fatalism on Depression score in different age groups. You can find age group variables already in the dataset.
ggplot2 package and geom_point along with the geom_segment function to make your plot.Landré, T., Karaboué, A., Buchwald, Z. S., Innominato, P. F., Qian, D. C., Assié, J. B., Chouaïd, C., Lévi, F., & Duchemann, B. (2024). Effect of immunotherapy-infusion time of day on survival of patients with advanced cancers: a study-level meta-analysis. ESMO open, 9(2), 102220. https://doi.org/10.1016/j.esmoop.2023.102220
Kebede, Mihiretu M. “Plotting Regression Model Coefficients in a Forest Plot.” Medium, 10 June 2021.
Morgenstern, L. B., Sánchez, B. N., Skolarus, L. E., Garcia, N., Risser, J. M., Wing, J. J., Smith, M. A., Zahuranec, D. B., & Lisabeth, L. D. (2011). Fatalism, optimism, spirituality, depressive symptoms, and stroke outcome: a population-based analysis. Stroke, 42(12), 3518–3523. https://doi.org/10.1161/STROKEAHA.111.625491